The objective here is to predict the RTFP\(^{NA}\) score for the next 10 years, for Canada, Mexico, and the USA. This factor represents the growth of the country based on capital received. Every year Penn World Table does the calculation to make possible compare all countries across the years.
These are the libraries for this case
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.1 ✓ dplyr 1.0.0
## ✓ tidyr 1.1.0 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(TSstudio)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(xts)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
library(skimr)
library(tidyquant)
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
## Loading required package: PerformanceAnalytics
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
## Loading required package: quantmod
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
## ══ Need to Learn tidyquant? ═══════════════════════════════════════════════════════════════════════════
## Business Science offers a 1-hour course - Learning Lab #9: Performance Analysis & Portfolio Optimization with tidyquant!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
library(ggtext)
# raw data
pwt_raw <- read_csv("data/TFP.csv")
## Parsed with column specification:
## cols(
## isocode = col_character(),
## year = col_double(),
## rtfpna = col_double()
## )
# first look
glimpse(pwt_raw)
## Rows: 186
## Columns: 3
## $ isocode <chr> "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA", "USA"…
## $ year <dbl> 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1…
## $ rtfpna <dbl> 0.6171479, 0.6295884, 0.6384513, 0.6518582, 0.6461794, 0.6687…
skim(pwt_raw)
| Name | pwt_raw |
| Number of rows | 186 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| isocode | 0 | 1 | 3 | 3 | 0 | 3 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 1980.50 | 17.94 | 1950.00 | 1965.00 | 1980.5 | 1996.00 | 2011.00 | ▇▇▇▇▇ |
| rtfpna | 0 | 1 | 0.98 | 0.18 | 0.62 | 0.86 | 1.0 | 1.05 | 1.38 | ▃▂▇▂▂ |
It’s possible to see that the series be distributed annually, and every observation indicates a different year. The data doesn’t have missing values, so will not be necessary any imputation method.
pwt_raw %>%
group_by(isocode) %>%
count()
## # A tibble: 3 x 2
## # Groups: isocode [3]
## isocode n
## <chr> <int>
## 1 CAN 62
## 2 MEX 62
## 3 USA 62
One problem identified is that the data have 186 observations, being 62 for each country. As was asked for the next 10 years of prediction, this represents 16.12% of the actual data.
# countries histograms
pwt_raw %>%
mutate(isocode2 = case_when(
isocode == "CAN" ~ "Canada",
isocode == "MEX" ~ "Mexico",
TRUE ~ "United States of America"
)) %>%
ggplot(aes(x = rtfpna, fill = isocode2)) +
geom_histogram(binwidth = .05, color = "white") +
scale_fill_manual(values = c("#1874CD", "#C28E42", "#00688B")) +
facet_wrap(~isocode2) +
theme_tq() +
guides(fill = FALSE) +
labs(
title = "RTFP^NA Histogram for the Countries",
x = "RTFP^NA",
y = "Frequency",
caption = "github.com/LucianoBatista"
) +
theme(plot.title = element_markdown(),
axis.title.x = element_markdown())
The data is close to the normal distribution, but for being small data this is not too clear.
# time series visualization
pwt_raw %>%
mutate(isocode2 = case_when(
isocode == "CAN" ~ "Canada",
isocode == "MEX" ~ "Mexico",
TRUE ~ "United States of America"
)) %>%
ggplot(aes(x = year,
y = rtfpna,
color = isocode2)) +
geom_line() +
scale_color_manual(values = c("#1874CD", "#C28E42", "#00688B")) +
facet_wrap(~isocode2) +
#expand_limits(y = 0) +
guides(color = FALSE) +
theme_tq() +
labs(
title = "Time Series for the Countries (1950-2011)",
x = "",
y = "RTFP^NA",
caption = "github.com/LucianoBatista"
) +
theme(
axis.title.y = element_markdown()
)
About the series profile:
Canada series: these series have a stationary profile, constant oscillation across the years.
USA series: a clear positive trend doesn’t have seasonal profile.
Mexico series: this series has been showing a negative trend for the last 20 years and also doesn’t have a clear seasonality.
To use the forecast package and others, it’s necessary to have a time series object, and for that was made the conversion.
# ts_USA
pwt_usa_tbl <- pwt_raw %>%
filter(isocode == "USA") %>%
select(year, rtfpna)
# ts_MEX
pwt_mex_tbl <- pwt_raw %>%
filter(isocode == "MEX") %>%
select(year, rtfpna)
# ts_CAN
pwt_can_tbl <- pwt_raw %>%
filter(isocode == "CAN") %>%
select(year, rtfpna)
# first and the last time series year
start_point <- min(pwt_raw$year)
end_point <- max(pwt_raw$year)
# ts.obj
ts_USA <- ts(data = pwt_usa_tbl$rtfpna, # the series values
start = start_point, # the time of the first observation
end = end_point, # the time of the last observation
frequency = 1) # the series frequency
ts_MEX <- ts(data = pwt_mex_tbl$rtfpna, # the series values
start = start_point, # the time of the first observation
end = end_point, # the time of the last observation
frequency = 1) # the Series frequency
ts_CAN <- ts(data = pwt_can_tbl$rtfpna, # the series values
start = start_point, # the time of the first observation
end = end_point, # the time of the last observation
frequency = 1) # the series frequency
# to view the time series with interactive
ts_plot(ts_USA) # time series of interest
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
These will be the objects used for the forecasting.
As was shown before, the time series doesn’t seem to have a visible seasonality. But, it’s better check that with stats::decompose() function.
To plot this, the function needs that the series has two or more periods, but all the series are yearly distributed. So, was applied one change in the frequency parameter, just for this visualization.
ts_USA_frq2 <- ts(data = pwt_usa_tbl$rtfpna,
start = start_point,
end = end_point,
frequency = 2) # changed
ts_MEX_frq2 <- ts(data = pwt_mex_tbl$rtfpna,
start = start_point,
end = end_point,
frequency = 2) # changed
ts_CAN_frq2 <- ts(data = pwt_can_tbl$rtfpna,
start = start_point,
end = end_point,
frequency = 2) # changed
plot(decompose(ts_USA_frq2))
plot(decompose(ts_MEX_frq2))
plot(decompose(ts_CAN_frq2))
It’s possible to see that all the seasonal components have very little variation in the y-axis. So, was assumed that all the series don’t have a seasonal component.
In this step will be used the ARIMA models, these classic models are known for your robust behavior, and are applicable to the most diverse kinds of time series.
The hyperparameters of ARIMA models will be tuning in regard to the AIC metric. The AIC (Akaike Information Criterion) represent how well the series will fit. The less the value, the better.
# time series visualization
ts_plot(ts_USA)
# train and test
ts_USA_split <- ts_split(ts_USA, sample.out = 10)
train_usa <- ts_USA_split$train
test_usa <- ts_USA_split$test
# diagnostic with ACF and PACF
# ACF e PACF to check for autocorrelation
par(mfrow = c(1, 2))
acf(train_usa)
pacf(train_usa)
Each bar in the ACF plot represents the level of correlation between the series and its lags in chronological order. The blue dotted lines indicate whether the level of correlation between the series and each lag is significant or not.
The PCF plot is similar, the unique difference is for considering also the periods before the actual lag.
The bars between the dotted lines meaning that we can reject the hypothesis that there’s autocorrelation for that lag.
For the USA, the ACF plot is showing that there is autocorrelation and PACF is showing that there’s a high probability that the data is not correlated.
# Seed for reproducibility
set.seed(123)
# hyperparametrs
p <- q <- P <- Q <- 0:2
arima_grid <- expand.grid(p, q, P, Q)
names(arima_grid) <- c("p", "q", "P", "Q")
arima_grid$d <- 1
arima_grid$D <- 1
arima_grid$k <- rowSums(arima_grid)
# grid search table
arima_grid <- arima_grid %>% filter(k <= 7)
# iter over the grid search table
arima_search <- lapply(1:nrow(arima_grid), function(i) {
md <- NULL
md <- arima(train_usa, order = c(arima_grid$p[i], 1, arima_grid$q[i]),
seasonal = list(order = c(arima_grid$P[i], 1, arima_grid$Q[i])),
method = "ML")
results <- data.frame(p = arima_grid$p[i], d = 1, q = arima_grid$q[i],
p = arima_grid$P[i], D = 1, Q = arima_grid$Q[i],
AIC = md$aic)
}) %>% bind_rows() %>% arrange(AIC)
head(arima_search)
## p d q p.1 D Q AIC
## 1 0 1 1 0 1 0 -295.5225
## 2 0 1 0 0 1 1 -295.5225
## 3 0 1 1 1 1 0 -293.5345
## 4 1 1 1 0 1 0 -293.5345
## 5 1 1 0 0 1 1 -293.5345
## 6 0 1 0 1 1 1 -293.5345
best_model_ts_USA <- arima(train_usa, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 0)))
best_model_test_fc <- forecast(best_model_ts_USA, h = 10)
accuracy(best_model_test_fc, test_usa)
## ME RMSE MAE MPE MAPE MASE
## Training set 0.0002214735 0.01160489 0.00938163 -0.01242196 1.237011 0.8166278
## Test set -0.0073651074 0.01684445 0.01244563 -0.73431150 1.244776 1.0833347
## ACF1 Theil's U
## Training set -0.002494444 NA
## Test set 0.731128842 1.481552
test_forecast(ts_USA,
forecast.obj = best_model_test_fc,
test = test_usa)
# final step
final_md <- arima(ts_USA, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 0)))
ts_USA_forecast <- forecast(final_md, h = 10)
plot_forecast(ts_USA_forecast)
# time series visualization
ts_plot(ts_MEX)
# train and test
ts_MEX_split <- ts_split(ts_MEX, sample.out = 10)
train_mex <- ts_MEX_split$train
test_mex <- ts_MEX_split$test
# diagnostic with ACF and PACF func
# ACF e PACF to check for autocorrelations
par(mfrow = c(1, 2))
acf(train_mex)
pacf(train_mex)
The behavior of the Mexico time series is similar to the USA, the difference is that across the years the series has been showing a negative correlation with the lags.
But, the PACF stay without apparent autocorrelation.
# using the same grid search table as before
# iterate over the grid search table
arima_search <- lapply(1:nrow(arima_grid), function(i) {
md <- NULL
md <- arima(train_mex, order = c(arima_grid$p[i], 1, arima_grid$q[i]),
seasonal = list(order = c(arima_grid$P[i], 1, arima_grid$Q[i])),
method = "ML")
results <- data.frame(p = arima_grid$p[i], d = 1, q = arima_grid$q[i],
p = arima_grid$P[i], D = 1, Q = arima_grid$Q[i],
AIC = md$aic)
}) %>% bind_rows() %>% arrange(AIC)
head(arima_search)
## p d q p.1 D Q AIC
## 1 0 1 1 0 1 0 -189.2815
## 2 0 1 0 0 1 1 -189.2815
## 3 0 1 2 0 1 0 -187.5147
## 4 0 1 0 0 1 2 -187.5147
## 5 1 1 1 0 1 0 -187.4391
## 6 0 1 1 1 1 0 -187.4391
best_model_ts_MEX <- arima(train_mex, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 0)))
best_model_test_fc <- forecast(best_model_ts_MEX, h = 10)
accuracy(best_model_test_fc, test_mex)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.005401946 0.03393699 0.02729705 -0.4490029 2.391014 0.9854180
## Test set -0.020989575 0.03142131 0.02481903 -2.2047011 2.582690 0.8959623
## ACF1 Theil's U
## Training set 0.03406867 NA
## Test set 0.34116392 1.121388
test_forecast(ts_MEX,
forecast.obj = best_model_test_fc,
test = test_mex)
# final step
final_md <- arima(ts_MEX, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 0)))
ts_MEX_forecast <- forecast(final_md, h = 10)
plot_forecast(ts_MEX_forecast)
# time series visualization
ts_plot(ts_CAN)
# train and test
ts_CAN_split <- ts_split(ts_CAN, sample.out = 10)
train_can <- ts_CAN_split$train
test_can <- ts_CAN_split$test
# diagnostic with ACF and PACF func
# ACF e PACF to check for auto-correlations
par(mfrow = c(1, 2))
acf(train_can)
pacf(train_can)
The ACF plot to CANADA time series is dropping faster down the blue dotted line and seems to initiate with a negative autocorrelation.
In the PACF, all the meaningful information stays between the blue dotted lines.
# using the same grid search table as before
# iterate over the grid search table
arima_search <- lapply(1:nrow(arima_grid), function(i) {
md <- NULL
md <- arima(train_can, order = c(arima_grid$p[i], 1, arima_grid$q[i]),
seasonal = list(order = c(arima_grid$P[i], 1, arima_grid$Q[i])),
method = "ML")
results <- data.frame(p = arima_grid$p[i], d = 1, q = arima_grid$q[i],
p = arima_grid$P[i], D = 1, Q = arima_grid$Q[i],
AIC = md$aic)
}) %>% bind_rows() %>% arrange(AIC)
head(arima_search)
## p d q p.1 D Q AIC
## 1 0 1 1 0 1 2 -249.1999
## 2 0 1 2 0 1 1 -249.1999
## 3 2 1 0 1 1 1 -248.7422
## 4 2 1 1 1 1 0 -248.7422
## 5 1 1 0 2 1 1 -248.7422
## 6 1 1 1 2 1 0 -248.7422
best_model_ts_CAN <- arima(train_can, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 2)))
best_model_test_fc <- forecast(best_model_ts_CAN, h = 10)
accuracy(best_model_test_fc, test_can)
## ME RMSE MAE MPE MAPE MASE
## Training set -0.002632124 0.01771258 0.01319837 -0.2745477 1.344406 0.8581564
## Test set -0.084149958 0.09637938 0.08414996 -8.7974533 8.797453 5.4714214
## ACF1 Theil's U
## Training set -0.01801727 NA
## Test set 0.69306916 6.311877
test_forecast(ts_CAN,
forecast.obj = best_model_test_fc,
test = test_can)
# final step
final_md <- arima(ts_CAN, order = c(0, 1, 1), seasonal = list(order = c(0, 1, 2)))
ts_CAN_forecast <- forecast(final_md, h = 10)
plot_forecast(ts_CAN_forecast)
All the forecast values can be find in the objects: ts_CAN_forecast, ts_MEX_forecast and ts_USA_forecast. Sadly, the confidence interval was very wide as long as the year becomes too distant to the actual value. This can be fixed by collecting more data.
Between the three predictions, the most precise was the forecast for the USA RTFP\(^{NA}\), which had a tight confidence interval.